Using Hadoop With R
===================

DBI is quite useful R package to pass the query to database and fetch back the result as data frame and take care of rest data manipulation work inside R.


Setup Hadoop on a Linux box, install hive, spark etc... 
Then use Rstudio server setup on the Linux side (web gui) to talk to Hadoop. 
Run SparkR package and the sparkRHive.init function. 
Boom. Now you are in Hive. 
Pull some hive data via SparkR into a local data frame from some HiveQL and then go wild. 
BTW... SparkR will let you write back to Hive too after you are finished. Good luck.


Use sparklyr package. Establish a connection and start doing miracles.


Take a look to RevoScaleR package, which is an R library that offers a set of functionalities for processing large datasets without having to load them all at once in the memory


On Windows:
1. Install Microsoft ODBC driver for Hive.
2. Configure your ODBC Hive data source.
3. Install and call R's RODBC library.
4. Open your connection with sqlConnect function. 
5. Fetch your data with sqlQuery function - you'll need basic SQL literacy here.


A lot of people are mentioning ODBC connectors, but I personally have found a lot of success with JDBC as well. The RJDBC package works for Hive interactive connection, but also RDBMS like teradata and oracle, plus postgres and Cassandra. 


https://www.r-bloggers.com/using-hadoop-with-r-it-depends/


Recommend the book of Prajapati (2013) Big Data analytics with R and Hadoop


I use Revolution Analytics (now its MicrosoftR) packages to connect RStudio with Hadoop Hive and HDFS.


What is the file that you want to process ? If it is a csv file, a json file, but a small file which is stored on Hadoop filse system, you can use hdfs (or hive, hbase ..) in order to get it on your filesystem from command line and then work on it with read_csv, Rjsonio ... H20 ... If your file is a real bigdata file (1 tera or above...), you will have to use mapreduce. You will have to install special stuffs like Rhadoop, Rhbase ... Rjava ... but that is really messy. You will need to get help from computer scientist. Hadoop provide a javastream for general purpose